Cluster analysis for portfolio optimization 



in 
o 
o 



Vincenzo Tola*,-'^'^ Fabrizio Lillo,'^^'* Mauro Gallegati,^ and Rosario N. Mantegna^^ 

' Dipariimento di Economia, Universitd Politecnica delle Marche, Piazza Martelli 8, 1-60121 Ancona, Italia 

'^Banca d' Italia, Rome, Italy 
^ Dipartimento di Fisica e Tecnologie Relative, Universitd di Palermo, viale delle Scienze, 1-90128 Palermo, Italia 

and 

INFN, Sezione di Catania, Catania, Italy 
^Santa Fe Institute, 1399 Hyde Park Road, Santa Fe NM 87501, USA 

We consider the problem of the statistical uncertainty of the correlation matrix in the optimization 
of a financial portfolio. We show that the use of clustering algorithms can improve the reliability of 
the portfolio in terms of the ratio between predicted and realized risk. Bootstrap analysis indicates 
that this improvement is obtained in a wide range of the parameters N (number of assets) and T 
(investment horizon) . The predicted and realized risk level and the relative portfolio composition of 
the selected portfolio for a given value of the portfolio return are also investigated for each considered 
filtering method. 
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I. INTRODUCTION 

The problem of portfolio optimization is one of the 
most important issue in asset management |3| . Since the 
seminal work of Markowitz '3| , which solved the problem 
under a certain number of simplifying assumptions (see 
also Section^, many other studies have been devoted 
to consider several aspects of portfolio optimization both 
from a theoretical and from an applied point of view. 
Here we focus our attention on the role of correlation 
coefficient matrix in portfolio optimization. The estima- 
tion of the correlation matrix has unavoidably associated 
a statistical uncertainty, which is due to the finite length 
of the asset return time series. Recently, there have been 
several contributions in the econophysics literature de- 
voted to quantify the degree of statistical uncertainty 
present in a correlation matrix. The results of these in- 
vestigations have been obtained by using concepts and 
tools of random matrix theory (RMT) [3. The RMT 
quantification of the statistical uncertainty associated 
with the estimation of the correlation coefficient matrix 
of a finite multivariate time series has been recently used 
to device a procedure to filter the information present 
in the correlation coefficient matrix which is robust with 
respect to the unavoidable statistical uncertainty (in the 
econophysics literature it has been used the term of noise 
dressing) iii ill 113,111 III III EHIlllllllIll 
The correlation matrices obtained by this filtering pro- 
cedure has been used in portfolio optimization. Some 
studies H 13 have shown that under the assumption of 
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perfect forecasting of future returns and volatilities the 
distance between the predicted optimal portfolio and the 
realized one is smaller for the filtered correlation matrix 
than for the original one at a given level of the portfolio 
return. 



In recent years, other filtering procedures of the cor- 
relation coefficient matrix performed using correlation 
based clustering procedures has also been prop osed in 
the econophwics literature J ^ Jj MJ p, IHlll HI IH 
m m El IM in m m IMlMlsir These methods 
also select information of the correlation coefficient ma- 
trix which is representative of the entire matrix and it 
is often less affected by the statistical uncertainty and 
therefore more stable than the entire matrix during the 
time evolution of the system. 



In this paper we investigate how the portfolio optimiza- 
tion procedure is sensitive to different filtering procedures 
applied to the correlation coefficient matrix. Specifically 
we consider filtering procedures based on RMT and on 
correlation based clustering procedures. The paper is or- 
ganized as follows. In Section ^ we describe briefly the 
mean variance optimization problem, we define the nota- 
tion and we summarize the problem of the estimation of 
the correlation matrix. In Section IlIII we review the ap- 
proach recently introduced 0,01 which makes use of the 
RMT to improve the portfolio optimization in the pres- 
ence of estimation errors due to the finiteness of sample 
data. In SectionElwe describe the clustering algorithms 
used to perform the portfolio optimization. These algo- 
rithms are average linkage and single linkage. In Sec- 
tion ^ we describe two methods based on these cluster- 
ing algorithms to build asset portfolios which are robust 
and reliable. Finally in Section IVII we summarize our 
results and indicate future work extending and possibly 
improving our method. 
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II. PORTFOLIO OPTIMIZATION 

A. Markowitz's solution 

In this section we briefly discuss the basic aspects of 
Markowitz portfoho optimization. This is also useful to 
set the notation and to state the assumption made and 
the methods used. Given TV risky assets the portfolio 
composition is determined by the weights (i = 1, N) 
giving the fraction of wealth invested in asset i. The 
weights are normalized as X]t=iP« ~ ^- ^^'^ average 
return and the variance of the portfolio arc 



N 



N N 

i=i j=i 



(1) 

(2) 



where is the mean return of asset i and aij is the 
covariance between returns of asset i and j. The opti- 
mization problem consists in finding the vector p which 
minimizes ap for a given value of rp. We assume that 
short selling is allowed, i.e. pi can assume negative val- 
ues. As known the solution of this optimization problem 
has been found by Markowitz P| and it is 



AS- 



(3) 



where S is the covariance matrix, 1^ = (!,...,!) and m 
is the vector of the mean returns of the N assets. The 
other parameters are 
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B. Curse of dimensionality and adopted method 

The Markowitz's solution to the optimization problem 
relies upon a series of assumptions that are rarely ob- 
served in practice. First of all the asset returns are as- 
sumed to be Gaussian variables whereas fat tails in price 
return distribution are observed. Second the parameters 
used in the optimization, i.e. the mean values m and the 
covariance matrix S, are assumed constant. Finally even 
if these quantity are really constant in the time horizon 
relevant for the problem, their statistical estimation over 
finite time intervals T leads to the problem known as 
curse of dimensionality. Since the covariance matrix has 
N{N - l)/2 ^ distinct entries whereas the num- 

ber of records used in the estimation is NT, one needs 
time series of length T > > TV in order to have small er- 
ror on the covariance. But for long T non stationarity 
becomes more and more important. For these reasons it 



is important to develop methods able to filter the part of 
the covariance matrix which is less likely to be affected 
by statistical uncertainty, and use (when possible) the 
filtered information to build portfolios. 

In this paper we are mainly concerned with the prob- 
lems in the portfolio optimization due to the estimation 
of the correlation matrix, i.e the matrix whose entries 
are the correlation coefficient between returns of dif- 
ferent assets. The correlation coefficient is defined as 
Pij = <Jijl Jcjaajj. Therefore we will use the following 
procedure [3,ll3 to assess the effectiveness of the filtering 
procedure of the correlation coefficient matrix based on 
RMT. Given N assets, a portfolio horizon of T trading 
days and a time when the optimization is supposed 
to take place, we compute the correlation matrix in the 
T days preceding but we compute the mean returns 
rrii and the volatilities oi = y/ou in the T days following 
Iq. We use these data to compute the covariance matrix 
and the predicted optimal portfolio at time t^. We then 
compare the predicted risk-return curve with the realized 
risk-return curve obtained by computing 
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where p* are the weights obtained in the optimization 
and aij is the covariance matrix observed between to and 
to + T. By using this procedure we are able to decou- 
ple the problem of estimating cross correlations from the 
problem of estimating mean returns and volatilities. In 
other words we will assume that the investor has a per- 
fect forecast of and ai and all her uncertainty is in 
the estimation of the cross correlation matrix. 

In order to quantify and compare the goodness of dif- 
ferent filtering methods we make use of three measures. 
The first quantity measuring the reliability of the portfo- 
lio is obtained by comparing, for a given value of expected 
return, the risk ap predicted by using the past correlation 
matrix with the realized risk dp of Eq. 0] A portfolio is 
more reliable when 



n = 



(5) 



is small. We have also used different measures of the 
reliability, such as jiTp — (Tp|, obtaining similar results. 

The second quantity used to compare different meth- 
ods is simply the realized risk ap. Clearly a portfolio is 
less risky of another one when its realized risk is smaller. 
Note that in general a portfolio with a small risk is not 
necessarily better than a more risky portfolio. In fact if 
the uncertainty on the risk of the safe portfolio is very 
large, an investor could face large fluctuations and there- 
fore a larger loss. 

The third characteristic for evaluating portfolio opti- 
mization methods is the degree of reduction in the ef- 
fective dimension of the portfolio. Dealing with a large 
portfolio can be very costly because of the transaction 
costs that the investor has to face any time she wants to 



rebalance the weights. Even if we do not consider here 
the problem of portfolio rebalancing and benchmarking, 
we wish to quantify the "effective" number of stocks with 
a significant amount of money invested in. We quantify 
this number as 

J^ieff) ^ _^ (g) 

This quantity is equal to 1 when all the wealth is in- 
vested in only one asset, whereas it is equal to N when 
the wealth is divided equally among the N assets, i.e. 
Pi = 1/N. It may be worth noting that the quantity 
j^(eff) does not give the number of assets where a non 
vanishing amount of wealth is invested in. It simply gives 
a rough estimate of the number of assets that could ef- 
fectively be used to build a smaller portfolio with risk- 
returns properties not too far from the original N asset 
portfolio. 

In the next sections, the different filtering procedures 
considered in this paper are investigated by using the set 
of data of 1071 stocks continuosly traded at New York 
Stock Exchange (NYSE) during the period 1988-1998. In 
this study, we consider daily returns. 

III. RANDOM MATRIX THEORY APPROACH 
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FIG. 1: The continuous lines are the predicted risk 
and the dashed lines are the realized risk. The cir- 
cles refer to the Markowitz portfolio optimization, 
whereas the squares are the predicted and realized 
risk curves obtained by filtering the correlation ma- 
trix with the Random Matrix Theory approach. We 
assume that the only uncertainty of the investor is 
on the correlation matrix. The dataset is composed 
by the 150 most capitalized stocks at NYSE in the 
period 1989 — 1992. The first two years are used 
for the estimation of the correlation matrix and the 
other two years are the investment period. 



Recently H Q it has been shown that the RMT can 
be useful to investigate the properties of return corre- 
lation matrices of financial assets. The simpler random 
matrix is a matrix of given type and size whose entries 
consist of random numbers from some specified distribu- 
tion "s] . RMT was developed originally in nuclear physics 
and then applied to many different fields. In the context 
of asset portfolios RMT is useful because allows to com- 
pute the effect of statistical uncertainty in the estimation 
of the correlation matrix. Suppose that the N assets are 
described by N time series of length T and that the re- 
turns are independent Gaussian random variables with 
zero mean and variance u^. The correlation matrix of 
this set of variables in the limit T — > cxd is simply the 
identity matrix. When T is finite the correlation ma- 
trix will in general be different from the identity matrix. 
RMT allows to prove that in the limit T, iV — > oo, with a 
fixed ratio Q = T/N > 1, the eigenvalue spectral density 
of the covariance matrix is given by 

y 

= jr\/iX,nax - A) (A - Xmin), (7) 



where A™°^ ^ a^il + l/Q± I^JlfQ). The spectral den- 
sity is different from zero in the interval ]Amm, 'Xmax\- In 
the case of a correlation matrix tr^ = 1. The spectrum 
described by Eq. {Tj) is different from NS{X — 1) which 
is expected by an identity correlation matrix. In other 
words RMT quantifies the role of the finiteness of the 
length of the time series on the spectral properties of the 
correlation matrix. 



RMT has been applied to correlation matrices of re- 
turns of financial assets ^ i^J and it has been shown that 
the spectrum of a typical portfolio can be divided in three 
classes of eigenvalues. The largest eigenvalue is totally 
incompatible with Eq. ((TJ and describes the common be- 
havior of the stocks composing the portfolio. A fraction 
of the order of 5% of the eigenvalues are also incompati- 
ble with the RMT because they fall outside the interval 
]Xmin, Xmax[- Thcsc eigenvalues probably describe eco- 
nomic information stored in the correlation matrix. The 
remaining large part of the eigenvalues is between Xmin 
and Xmax and thus one cannot say whether any informa- 
tion is contained in the corresponding eigenspace. 

The fact that by using RMT it is possible, under cer- 
tain assumptions, to identify the noisy part of the cor- 
relation matrix suggested several authors (2t ilOi] to use 
RMT in the optimization of financial portfolios. Specifi- 
cally the suggested method |0 is the following. 

One computes the correlation matrix and finds the 
spectrum ranking the eigenvalues such that A^ < A^+i. 
One then computes the variance of the part not explained 
by the highest eigenvalue as cr^ = 1 — Xi/N and uses this 
value in Eq. |7| to compute Xmin and Xmax ■ One then 
constructs a filtered diagonal matrix obtained by setting 
to zero all the eigenvalues smaller than Xmax and leav- 
ing unaltered the remaining ones. Finally one obtains the 
filtered correlation matrix by transforming the filtered di- 
agonal matrix in the original basis. In order to obtain a 
meaningful correlation matrix we set to one the diagonal 
elements of the filtered correlation matrix. This matrix 
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preserve only the information of the original correlation 
matrix that the RMT recognize as signal. It has been 
shown that the portfolio obtained by using the filtered 
correlation matrix has a smaller value of TZ than a portfo- 
lio with weights obtained with the Markowitz's procedure 
and by using the whole correlation matrix. As an exam- 
ple we show in Figure ^ the predicted and realized risk 
for a portfolio of the 150 most capitalized stocks traded 
at NYSE in a period of T = 500 trading days. In this and 
in the other figures the risk and return are computed on a 
yearly time horizon. We have estimated the correlation 
matrix in the two year period 1989-1990 and the real- 
ized risk is computed in the two year period 1991-1992. 
The figure shows the risk-return curve for the Markowitz 
portfolio and for a portfolio obtained by filtering the cor- 
relation matrix with the RMT method outlined above. 
We note that for all the value of the expected return 
the parameter TZ for the RMT portfolio is significantly 
smaller than for the Markowitz portfolio. For this portfo- 
lio the realized risk of the RMT portfolio is smaller than 
the realized risk of the Markowitz portfolio. This is in 
agreement with what, for example, Rosenow et al. 
find for another set of data. We wish to point out that 
the behavior of Fig. 1 is rather common in different port- 
folios. However, we have found that for some portfolios 
the realized risk profile obtained with the RMT filter- 
ing is larger than the one obtained with the Markowitz 
approach. 



IV. CLUSTERING ALGORITHMS 

In this paper, we introduce a new portfolio optimiza- 
tion technique which is based on clustering algorithms. 
Clustering is a common practice in multivariate data 
analysis [36j. The purpose of clustering analysis is to 
obtain a meaningful partition of a set of N variables 
in groups according to their characteristics. For exam- 
ple in correlation based clustering algorithms (adopted 
here) the correlation coefficient between two time se- 
ries is assumed as a measure of the similarity between 
the two time series. Correlation based clustering has 
been recently used to infer the hierarchical structure of a 
portfolio of stocks from its correlation coefficient matrix 
[T^ m, 1^. Correlation based clustering may be seen 
as a filtering procedure, i.e. a matrix transformation re- 
taining a smaller number of distinct elements. After its 
application one usually retain a subset of the distinct el- 
ements composing the correlation coefficient matrix. For 
example, in the clustering algorithm of the single link- 
age |37j the number of distinct elements present in the 
filtered matrix is ti — 1 whereas the number of distinct 
elements present in the original matrix is n(n — l)/2. 
The selection of these n — 1 elements is done according 
to some widespread algorithm |3l|- A possible concep- 
tual description of the algorithm is the following. Let us 
assume that a similarity measure S between pairs of ele- 
ments is defined, e.g. the correlation coefficient between 



pairs of elements of the system. An ordered list Sard of 
pair of elements can be constructed by arranging them 
in a descending order accordingly with the value of the 
similarity between element i and element j . Different 
elements are iteratively included in clusters starting from 
the first two elements of the similatity measure ordered 
list. At each step, when two elements or one element and 
a cluster or two clusters p and q merge in a wider sin- 
gle cluster t, the similarity or distance between the new 
cluster t and cluster r is determined as follows: if Sij is 
a correlation-like measure 

Str = max{Spr, Sqr} (8) 

indicating that the similarity between any element of 
cluster t and any element of cluster r is the similarity 
between the two most similar entities in clusters t and r. 
Conversely, if Sij is a distance-like measure 

Str = min{Spr, Sqr}- (9) 

By applying iterarively this procedure n — 1 of the 
n{n — l)/2 distinct elements of the correlation coefficient 
matrix are selected. When a distance-like measure is used 
as, for example, = ■y/2(l — py) [s^l, the distance ma- 
trix obtained by applying the single linkage procedure is 
an ultrametric matrix comprising the n — 1 distinct se- 
lected elements. Ultrametric distances df- are distances 
satisfying a inequality d^^ < maxjd^^, d^^} stronger than 
the customary triangular inequality dac < dab + dtc ^3 . 
In particular, the single linkage clustering procedure has 
associated an ultrametric correlation coefficient matrix 
which is the subdominant ultrametric matrix of the orig- 
inal correlation coefficient matrix. For a didactic descrip- 
tion of the method used to obtain the ultrametric matrix 
one can consult Ref. [ilf . 

In Ref. ji^ it is proved that the ultrametric corre- 
lation matrix obtained by the single linkage clustering 
procedure of the correlation coefficient matrix is always 
positive definite when all the elements of the obtained 
ultrametric correlation matrix are positive. This condi- 
tion is rather common in financial data of stock portfolio 
and it has always been observed for all the investigations 
we have performed so far. The effectiveness of the single 
linkage clustering procedure in pointing out the hierarchi- 
cal structure of the inve stig ated portfolio has been shown 
by several studies [IlilllllliaiaTlliilllllllll. 
However, the single linkage is just one possible correla- 
tion based filtering methods. Other methods have also 
been apphed to financial portfofios [H El IM El El 
.27. J8, 30_, J4, ^] . Each method puts a specific empha- 
sis on some aspects of the original matrix and is usually 
able to point out a series of aspects that might not be 
elucidated by different filtering procedure. The choice 
of the filtering method must therefore be guided by the 
specific goals that one pursues. In the present study, we 
decide to consider the average linkage procedure in ad- 
dition to the single linkage procedure. The average link- 
age is another widespread clustering algorithm 43] . The 



difference with the single Unkage algorithm is that the 
similarity measure between an element and the closest 
cluster is given by the mean similarity measure between 
the considered element and each element of the closest 
cluster. In other words, if Sy is a similarity-like measure, 
at each stage one obtains str between clusters t and r de- 
fined as above, as the average distance between all pairs 
of links of the elements belonging to the two clusters. 
For a detailed discussion of the average linkage cluster 
algorithm see, for example Ref. Also in the case 

of the average linkage the filtered correlation coefficient 
matrix is a ultrametric distance. In Ref. it is proved 
that the ultrametric correlation coefficient matrix asso- 
ciated with the average linkage clustering procedure is 
positive definite under the same general conditions valid 
for the case of the single linkage. However, this property 
is not generic to all clustering procedures. We have veri- 
fied that it does not apply for the cases of the complete 
linkage and for the Ward clustering method. 



V. PORTFOLIO OPTIMIZATION WITH 
CLUSTERING ALGORITHMS 
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FIG. 2: The continuous lines are the predicted risk 
and the dashed lines are the realized risk. The filled 
blue circles refer to the portfolio obtained with the 
average linkage method. The empty circles refer to 
the Markowitz portfolio optimization, whereas the 
squares are the predicted and realized risk curves 
obtained by filtering the correlation matrix with the 
Random Matrix Theory approach (same data as in 
Fig.0. We assume that the only uncertainty of the 
investor is on the correlation matrix. The dataset 
is composed by the 150 most capitalized stocks at 
NYSE in the period 1989-1992. The first two years 
are used for estimate the correlation matrix and the 
other two years are the investment period. 



The portfolio optimization method we propose here is 
based on the use of the ultrametric matrix associated to 
a given clustering method as a meaningful and robust fil- 
tration of the original correlation matrix. In other words 
we construct the portfolio by solving the Markowitz opti- 
mization problem by using the ultrametric matrix rather 
than the original correlation matrix or the RMT filtered 
matrix. The reasons for this choice are (i) it is known 
that clustering algorithms are able to filter the relevant 
information in a multivariate set of data (ii) other stud- 
ies indicate that clustering algorithms are quite robust 
with respect to measurement noise due to the finiteness 
of sample size. This is particularly true for set of vari- 
ables hierarchically organized ■ 

From the filtered (ultrametric) correlation matrix we 
build the portfolio by using the Markowitz result (Eq. ^ 
and for each value of the portfolio return we find the 
predicted risk a^. We note that in order to consider the 
ultrametric matrix as a meaningful correlation matrix 
it is important that the matrix is positive definite (or 
semidefinite). We have performed a very large number 
of portfolio optimization using real data and we have not 
found a single case in which the ultrametric matrix is 
not positive definite. We used the weights obtained from 
the optimization to compute the realized risk by using 
Eq.^where Oij is the original covariance matrix. In other 
words we use the filtered matrix only for obtaining the 
weights "Pi , whereas the realized risk is clearly determined 
by the whole correlation matrix. 



A. Average linkage 

We consider first portfolio built by using correlation 
matrices filtered with the average linkage cluster algo- 
rithm. 



1. Reliability 

Figure 12 shows the predicted and reahzed risk for the 
portfolio obtained with the average linkage considering 
the same set of stocks and the same time period as in 
Fig. ^ The distance between predicted and realized risk 
for the portfolio obtained with average linkage is signif- 
icantly smaller than the distance for the portfolio ob- 
tained with the RMT. This result indicates clearly that 
the use of clustering methods to build financial portfolio 
is able to provide portfolios more reliable (in terms of the 
error in the forecasted risk) than the ones obtained with 
RMT and with Markowitz optimization. We also note 
that for this set of data the realized risk of the portfo- 
lio obtained with the clustering method is almost always 
smaller than the realized risk of the RMT portfolio. 

In order to verify the robustness of these results we 
have performed an extensive bootstrap experiment. We 
have considered many different values of the portfolio size 
N and of the investment horizon T and for each couple 
(N, T) we have randomly sampled 50 portfolio composed 
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FIG. 3: Density plot showing the percentage of 
success of the average hnkage portfoho optimiza- 
tion technique over the Random Matrix Theory ap- 
proach as a function of the number of asset A*' and 
the investment period T. The white area corre- 
sponding to cases where T < A'^ + 50 is not consid- 
ered in our investigation. 

by N stocks and we have selected randomly 50 initial 
times to- For each portfolio we have considered 10 values 
of the expected portfolio return Vp. Specifically we have 
taken ten equispaced values of Vp between the value of 
Tp associated with the absolute minimum risk and the 
highest expected return among the N stocks of the port- 
folio. For each expected return and for each portfolio we 
have computed the parameter TZ and we have counted the 
fraction of times that TZav.Unk < Ti-RMT, i-e. the percent- 
age of cases in which the portfolio obtained with average 
linkage is more reliable than the portfolio obtained with 
Random Matrix Theory. The result of this analysis is 
shown in Figure |3| The figure indicates that the average 
linkage portfolio outperforms the RMT portfolio almost 
for any value of N and T. For N ~ 350 and T ~ 500 the 
average linkage portfolio is more reliable than the RMT 
portfolio more than 85% of the times. The reliability of 
average linkage portfolio is higher when the number of 
stocks is large, For small size {N < 50) portfolios the 
two methods are statistically equally reliable. 



2. Riskiness 

We now compare the realized risk of the portfolios ob- 
tained with the two methods, i.e. RMT and average link- 
age. The realized risk is a measure of the riskiness of the 
portfolio. We observe that small size portfolio tends to 
be less risky when obtained with the average linkage, 
whereas as N increases the RMT portfolios become less 
risky. The boundary between the two regions is approxi- 
mately for N 75. By comparing this result with figure 
13 we see that when the average linkage is more reliable 
it is also riskier and vice-versa. There is a small region 



around TV ~ 50 where it is possible to build portfolio 
with average linkage which are reliable and not too risky. 

It is important to stress that, as for Figure|31 the above 
result on riskiness is obtained by putting together all the 
values of portfolio expected return r^. On the other hand 
we find that the riskiness of average linkage portfolio 
compared to RMT portfolio strongly depends on Vp espe- 
cially for large portfolios. Specifically when we consider 
large portfolios (50 < N < 500) we find that for small 
Tp only in ~ 25% of the cases the average linkage port- 
folio is less risky than RMT portfolio. When Vp is large 
this fraction is of the order of ~ 45%. In other words for 
portfolios with large rp the average linkage portfolios is 
approximately as risky as the RMT portfolios. 

3. Effective size 

Finally we consider the effective size A/'('=//) of the 
portfolio as quantified by Eq. We consider three port- 
folio sizes, i.e. N = 50,300, and 500, and we select two 
values of the portfolio expected return rp, i.e. the mini- 
mum value (corresponding to the minimum risk) and an 
intermediate value between the minimum and the max- 
imum. Figure ^ shows M^'^^^'^ as a function of the in- 
vestment horizon T. Similar results are observed for high 
values of expected return, but in this case the dimension- 
ality of the portfolio becomes smaller and smaller and the 
wealth is more and more concentrated in the asset with 
highest return. We note that for small portfolio {N = 50) 
the effective size of RMT portfolios is slightly smaller 
than the effective size for average linkage portfolios. On 
the other hand for larger portfolios the effective size of 
average linkage portfolios is significantly smaller than the 
effective size of RMT portfolios. This result shows that 
portfolios built with average linkage have a smaller ef- 
fective dimensionality, i.e. the maintenance cost of these 
portfolios is smaller than for the one of RMT portfolios. 

B. Single linkage 

1. Reliability 

We performed the same analysis by using a different 
clustering algorithm, specifically the single linkage clus- 
ter analysis. By using the same data as in Figures ^ and 
121 we compute the curves for predicted and realized risk 
for a portfolio built by using the ultrametric matrix asso- 
ciated with the single linkage algorithm. The result is in 
Fig. [S| Also in this case the predicted and realized risk 
for single linkage portfolio are significantly closer than 
the corresponding quantities for Markowitz and for RMT 
portfolios. In this case it is more evident that the real- 
ized risk of the single linkage portfolio is larger than the 
other two realized risks. Thus the single linkage portfolio 
is riskier but more reliable when compared with RMT 
portfolio. 
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FIG. 4: Effective size Af^"-^-^^ of the porfolio as de- 
fined in Eq.|S|as a function of investment horizon T. 
The black circles refer to RMT portfolio and the red 
squares to average linkage portfolio. The portfolio 
size is A*' = 50 (panels (a) and (b)), A*' = 300 (pan- 
els (c) and (d)) and N = 500 (panels (e) and (f)). 
The left panels (a,c, and e) refer to the minimum 
value of portfolio expected return Vp and the right 
panels (b,d, and f) refer to an intermediate value of 
Tp. Every point is the average over 50 realizations 
obtained by bootstrapping and the error bars are 




FIG. 6: Density plot showing the percentage of 
success of the single linkage portfolio optimiza- 
tion technique over the Random Matrix Theory ap- 
proach as a function of the number of asset A'^ and 
the investment period T. The white area corre- 
sponding to cases where T < A*' 4- 50 is not consid- 
ered in our investigation. 



Also for the single linkage method we perform a boot- 
strap analysis similar to the one described above for the 
average linkage method in order to compare the rehability 
of single linkage method as compared to RMT method. 
The result is the density plot shown in Fig. El We find 
that the single linkage method is able to provide more 
reliable portfolios in wide ranges of the parameters N 
and T. This is more and more evident for portfolios with 
T ~ A^, i.e. portfolios for which the investment hori- 
zon (in trading days) is comparable with the portfolio 
size N. It is interesting to note that for these portfolio 
the effect of the measurement noise ( "noise dressing" ) is 
particularly high. 



6!o5 0.1 0.15 0.2 

risk 

FIG. 5: The continuous lines are the predicted 
risk and the dashed lines are the realized risk. The 
filled green diamonds refer to the portfolio obtained 
with the single linkage method. The empty cir- 
cles refer to the Markowitz portfolio optimization, 
whereas the squares are the predicted and real- 
ized risk curves obtained by filtering the correla- 
tion matrix with the Random Matrix Theory ap- 
proach (same data as in fig. 0. We assume that 
the only uncertainty of the investor is on the cor- 
relation matrix. The dataset is composed by the 
150 most capitalized stocks at NYSE in the period 
1989 — 1992. The first two years are used for esti- 
mate the correlation matrix and the other two years 
are the investment period. 



2. Riskiness 

The analysis of the riskiness, i.e. the realized risk, of 
single linkage portfolios shows that these portfolio are 
systematically riskier than RMT portfolios. This is also 
seen in the example shown in figure[51 Only for very small 
size (iV < 15) the single linkage portfolios are less risky 
than RMT portfolios. As for the average linkage portfolio 
this effect strongly depends on the portfolio expected re- 
turn. When we consider large portfolios (50 < N < 500) 
we find that for small Vp only in ^ 0.3% of the cases the 
single linkage portfolio is less risky than RMT portfolio. 
When Tp is large this fraction is of the order of ~ 10%. In 
any case these figures indicate that single linkage portfo- 
lios are risky, even if they can be quite reliable. 
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FIG. 7: Effective size M'-"^^^ of the porfolio as de- 
fined in Eq. |^ as a function of investment horizon 
T. The black circles refer to RMT portfolio and the 
red squares to single linkage portfolio. The portfolio 
size is A'^ = 50 (panels (a) and (b)), A*' = 300 (pan- 
els (c) and (d)) and = 500 (panels (e) and (f)). 
The left panels (a,c, and e) refer to the minimum 
value of portfolio expected return and the right 
panels (b,d, and f) refer to an intermediate value of 
r-p. Every point is the average over 50 realizations 
obtained by bootstrapping and the error bars are 
standard errors. 



3. Effective size 

The effective size J\f'-'^ff'> of the portfolio as quantified 
by Eq. Elfor the single linkage portfoho shows interesting 
properties. Figure |3 shows the effective size for different 
portfoho conditions and should be compared with fig- 
ure 01 We see that for any value of the portfolio size N 
the effective size of the single linkage portfolio is signifi- 
cantly smaller than the one of the RMT portfolio. This 
effect is observed also for small size portfolios, in con- 
trast with what we observe for average linkage portfolios. 
Even more important, for large portfolios (where the size 
reduction is an important issue) the single linkage portfo- 
lios have an effective size which is roughly half the effec- 
tive size of RMT portfolio. This result suggest that single 
linkage portfolios could be used to detect a small subset of 
stocks which is representative of the whole portfolio, and 
thus to replace the original portfolio with another one of 
significantly smaller size with reduced maintenance costs. 
This possibility will be explored in a future work. 



VI. CONCLUSIONS 

In this paper we have performed portfolio optimization 
by using filtered correlation coefficient matrices. These 
matrices have been obtained by applying different filter- 
ing methods to the original correlation coefficient ma- 
trix. We have proposed two filtering methods based on 



the average linkage and single linkage clustering proce- 
dures. The optimal portfolios obtained with these two 
new methods have been compared with the one based on 
RMT recently proposed in Refs fHH^. 

A large set of simulations has shown that clustering 
methods are outperforming RMT filtering when we con- 
sider the reliability of the estimation of the realized port- 
folio with respect to the predicted one for portfolios with 
a number of assets « 50 < <~ 500. Hence, for rel- 
atively large portfolios the clustering filtering methods 
provide a more reliable estimation of the predicted risk- 
return profile both with respect to the Markowitz basic 
estimation and with respect to the determination of the 
correlation coefficient done with the RMT filtering 

The portfolios obtained with the average linkage shows 
a predicted and realized risk return profile which is of- 
ten inside the corresponding profiles obtained both with 
the Markowitz basic estimation and after the RMT filter- 
ing. In the case of the single linkage clustering method 
the risk-return profile shows risk levels which are sys- 
tematically higher than the ones obtained both with the 
Markowitz basic estimation and after the RMT filtering. 
Therefore with respect to the aspect of the level of risk 
associated to the selected portfolios the most successful 
methods are the average linkage and the RMT filtering. 

Another aspect investigated in our study refers to the 
composition of the portfolios selected. We have quanti- 
fied the degree of homogeneity of the distribution of the 
wealth across the stocks of the portfolio through what 
we have called the "effective size" of the portfolio. A 
small number of this parameter indicates an uneven dis- 
tribution of the portfolio wealth suggesting that during 
portfolio re-balancing only a subset of stocks will be sig- 
nificantly involved. The investigation of the "effective 
size" of the portfolio has shown that the average link- 
age and the RMT are characterized by not too different 
values of the "effective size". In fact for small portfo- 
lios (e.g. N = 50) the RMT has for most values of T a 
smaller value of the " effective size" whereas the pattern 
is reversed for medium (A^ = 300) and large portfolios 
(A^ — 500) both for the minimum and intermediate value 
of Tp. The pattern is clearly different for the case of the 
single linkage filtering. In this case the "effective size" 
is always significantly less than the one observed in the 
cases of RMT filtering. 

The above discussion shows that the different filter- 
ing procedures provide different portfolio optimization 
results that are characterized by specific strengths or 
weaknesses. In other words, for each value of N and 
T, the most useful filtered correlation coefficient matrix 
can be different depending on the strongest constraint 
the investor has among the risk level of the portfolio, 
the reliability of the estimation and the portfolio " effec- 
tive size" . We believe that the two clustering methods we 
have proposed here and the RMT are not exhaustive with 
respect to all potential aspects of portfolio optimization 
and, probably, other filtering methods could also provide 
very interesting results in specific regimes of the different 
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control parameters. 

The different results of the different filtering methods 
raise the scientific question of which is the reason for the 
difference between the various filtering procedures. A 
precise quantification of the information retained by the 
different filtered matrices would be very useful. This goal 
is left for future research. 
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